The Anchors Hierarchy: Using the Triangle Inequality to Survive High Dimensional Data

نویسنده

  • Andrew W. Moore
چکیده

This paper is about the use of metric data structures in high-dimensional or non-Euclidean space to permit cached sufficient statistics accelerations of learning algorithms. It has recently been shown that for less than about 10 dimensions, decorating kd-trees with additional "cached sufficient statistics" such as first and second moments and contingency tables can provide satisfying acceleration for a very wide range of statistical learning tasks such as kernel regression, locally weighted regression, k-means clustering, mixture modeling and Bayes Net learning. In this paper, we begin by defining the anchors hierarchy---a fast data structure and algorithm for localizing data based only on a triangle-inequality-obeying distance metric. We show how this, in its own right, gives a fast and effective clustering of data. But more importantly we show how it can produce a well-balanced structure similar to a Ball-Tree (Omohundro 1990) or a kind of metric tree (Uhlmann 1991)in a way that is neither "top-down'' nor "bottom-up'' but instead "middle-out". We then show how this structure, decorated with cached sufficient statistics, allows a wide variety of statistical learning algorithms to be accelerated even in thousands of dimensions. The Anchors Hierarchy: Using the triangle inequality to survive high dimensional data Andrew W. Moore Carnegie Mellon University Pittsburgh, PA 15213 February 18, 2000 Abstract This paper is about the use of metric d a t a structures in high-dimensional or nonEuclidean space to permit cached sufficient statistics accelerations of learning algorithms. I t has recently been shown t h a t for less than about 10 dimensions, decorating kdtrees with additional “cached sufficient statistics” such as first and second moments and contingency tables can provide satisfying acceleration for a very wide range of statistical learning tasks such as kernel regression, locally weighted regression, k-means clustering, mixture modeling and Bayes Net learning. In this paper, we begin by defining the anchors h i e r a r c h p a fast d a t a structure and algorithm for localizing data based only on a triangle-inequality-obeying distance metric. We show how this, in i ts own right, gives a fast and effective clustering of da ta . But more importantly we show how i t can produce a well-balanced structure similar to a Ball-Tree (Omohundro, 1991) or a kind of metric tree (Uhlmann, 1991; Ciaccia., Patella, & Zezula, 1997) in a way t h a t is neither “ topdown” nor “bottom-up” but instead “middle-out”. We then show how this structure, decorated with cached sufficient statistics, allows a wide variety of statistical learning algorithms to be accelerated even in thousands of dimensions.This paper is about the use of metric d a t a structures in high-dimensional or nonEuclidean space to permit cached sufficient statistics accelerations of learning algorithms. I t has recently been shown t h a t for less than about 10 dimensions, decorating kdtrees with additional “cached sufficient statistics” such as first and second moments and contingency tables can provide satisfying acceleration for a very wide range of statistical learning tasks such as kernel regression, locally weighted regression, k-means clustering, mixture modeling and Bayes Net learning. In this paper, we begin by defining the anchors h i e r a r c h p a fast d a t a structure and algorithm for localizing data based only on a triangle-inequality-obeying distance metric. We show how this, in i ts own right, gives a fast and effective clustering of da ta . But more importantly we show how i t can produce a well-balanced structure similar to a Ball-Tree (Omohundro, 1991) or a kind of metric tree (Uhlmann, 1991; Ciaccia., Patella, & Zezula, 1997) in a way t h a t is neither “ topdown” nor “bottom-up” but instead “middle-out”. We then show how this structure, decorated with cached sufficient statistics, allows a wide variety of statistical learning algorithms to be accelerated even in thousands of dimensions. 1 Cached Sufficient Statistics This paper is not about new ways of learning from data, but instead how to allow a wide variety of current learning methods, case-based tools, and statistics methods scale up to large datasets in a computationally tractable fashion. A cached sufficient statistics representation is a data structure that summarizes statistical information from a large dataset. For example, human users, or statistical programs, often need to query some quantity (such as a mean or variance) about some subset of the attributes (such as size, position and shape) over some subset of the records. When this happens, we want the cached sufficient statistic representation to intercept the request and, instead of answering it slowly by database accesses over huge numbers of records, answer it immediately.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Anchors Hierachy: Using the triangle inequality to survive high dimensional data

This paper is about metric data structures in high-dimensional or non-Euclidean space that permit cached sufficient statistics accel­ erations of learning algorithms. It has recently been shown that for less than about 10 dimensions, decorating kd­ trees with additional "cached sufficient statis­ tics" such as first and second moments and contingency tables can provide satisfying ac­ celeration...

متن کامل

On the metric triangle inequality

A non-contradictible axiomatic theory is constructed under the local reversibility of the metric triangle inequality. The obtained notion includes the metric spaces as particular cases and the generated metric topology is T$_{1}$-separated and generally, non-Hausdorff.

متن کامل

On improving APIT algorithm for better localization in WSN

In Wireless Sensor Networks (WSNs), localization algorithms could be range-based or range-free. The Approximate Point in Triangle (APIT) is a range-free approach. We propose modification of the APIT algorithm and refer as modified-APIT. We select suitable triangles with appropriate distance between anchors to reduce PIT test errors (edge effect and non-uniform placement of neighbours) in APIT a...

متن کامل

Using Triangle Inequality to Efficiently Process Continuous Queries on High-Dimensional Streaming Time Series

In many applications, it is important to quickly find, from a database of patterns, the nearest neighbors of highdimensional query points that come into the system in a streaming form. Treating each query point as a separate one is inefficient. Consecutive query points are often neighbors in the high-dimensional space, and intermediate results in the processing of one query should help the proc...

متن کامل

Outlier Detection for Robust Multi-dimensional Scaling

Multi-dimensional scaling (MDS) plays a central role in data-exploration, dimensionality reduction and visualization. State-of-the-art MDS algorithms are not robust to outliers, yielding significant errors in the embedding even when only a handful of outliers are present. In this paper, we introduce a technique to detect and filter outliers based on geometric reasoning. We test the validity of ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000